RoCo-News: A Hand Validated Journalistic Corpus of Romanian

نویسندگان

Dan Tufis

Elena Irimia

چکیده

The paper briefly describes the Ro-Co project and, in greater details, one of its first outcomes, the RoCo-News corpus. Ro-Co- is a series of various registers corpora of Romanian that are developed within the Research Institute for Artificial Intelligence of the Romanian Academy. They are planned for public release to the research community as the underlying automatic annotations are validated by human experts. Currently the automatically created data, but not human validated, covers three registers: News, Literature and Legislation and consists of more than 35 millions of lexical tokens. All the corpora of the Ro-Co project are XML annotated and the minimal attributes for each lexical token are its morpho-lexical tag, and lemma. Additionally, the lexical tokens may have specified the hyphenation, a sense identifier and a chunk identifier (specifying to which syntactic chunk the token belongs). In case parallel texts in other languages are available (which, currently, is the case for about 80% of the RoCo corpora) sentence and word alignments are also included. A further extension will be the inclusion of dependency relations among the lexical tokens as well as anaphor-antecedent relations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building and annotating a corpus for the study of journalistic text reuse

In this paper we present the METER Corpus, a novel resource for the study and analysis of journalistic text reuse. The corpus consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers. In some cases the newspaper stories are rewritten from the PA source; in other ...

متن کامل

Competition of Discourses in Journalistic Translation: Diplomatic Negotiations in Focus

We sought to understand whether, how, and why the translated journalistic texts related to the Iranian nuclear negotiations were manipulated. To this end, we monitored a news agency’s Webpage in a time span of 46 days that began 3 days before Almaty I nuclear talks and ended 3 days after Almaty II talks. Monitoring resulted in a corpus made up of 36 target texts p...

متن کامل

Frame Labeling of Competing Narratives in Journalistic Translation

Studying translations during the time of conflict has gained currency in the recent decade in translation studies. One of the cases in which conflict manifests itself is in the way different countries choose to name an event or a geographical location, for example. This study set out to understand how translation of rival names and labeling was carried out in Iranian state-run news agencies. To...

متن کامل

Resolving Romanian Zero Pronouns: A Machine Learning Approach

This paper presents a new study on the distribution, identification, and resolution of zero pronouns in Romanian. A Romanian corpus, including legal, encyclopaedic, literary, and news texts has been created and manually annotated for zero pronouns. Using a morphological parser for Romanian and machine learning methods, experiments were performed on the created corpus for the identification and ...

متن کامل